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Abstract 


This report describes an approach for the assessement of 
upset resilience that is applicable to systems in general, 
including safety-critical, real-time systems. For this work, 
resilience is defined as the ability to preserve and restore service 
availability and integrity under stated conditions of 

configuration, functional inputs and environmental conditions. 
To enable a quantitative approach, we define novel system 
service degradation metrics and propose a new mathematical 
definition of resilience. These behavioral-level metrics are 
based on the fundamental sendee classification criteria of 
correctness, detectability, symmetry and persistence. This 
approach consists of a Monte-Ccrrlo-based stimulus injection 
experiment, on a physical implementation or an error- 
propagation model of a system, to generate a system response 
set that can be characterized in terms of dimensional error 
metrics and integrated to form an overall measure of resilience. 
We expect this approach to be helpful in gaining insight into the 
error containment and repair capabilities of systems for a wide 
range of conditions. 
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1. Introduction 


A research effort is underway to develop practical validation and verification (V&V) methods that can 
enable rigorous safety assurance for the next generation of aviation systems. These systems are 
characterized by highly complex, large-scale, network-based distributed architectures with software- 
implemented functionality and advanced computation and communication capabilities. To meet the 
safety goals, these systems must be demonstrably robust with respect to system design and 
implementation errors, component degradations and failures, and partial system failures. The V&V 
challenge is compounded by strong coupling of system components in the software and the hardware, as 
well as the need to consider unexpected and possibly malicious component behaviors [26] . 

To support this research effort, verification approaches are being developed for robust distributed 
algorithms that support system redundancy management in a fault space with a wide range of severity. A 
system architecture for safety-critical real-time applications must have the ability to mitigate the effects of 
internal component faults of varying severity [18]. A safety-critical system must have sufficient design 
fault tolerance to accommodate the more frequent uncorrelated random faults without malfunctions at the 
system services. A robust system must also mitigate infrequent but more severe correlated faults that can 
exceed the system design assumptions, disrupt internal coordinated operation among the system 
components and propagate effects outward to the external service interfaces. Analysis techniques will be 
developed for system designs intended to ensure continued safe operation in the presence of component 
misbehavior while simultaneously minimizing their adverse effects. These techniques should enable 
designs with strongly assured safety properties under the weakest possible (i.e., least restrictive) 
assumptions in terms of the number and types of faults a system can handle. 

In this research context, a physical fault injection experiment was conducted in which a prototype 
implementation of an onboard data network for distributed safety-critical, real-time Integrated Modular 
Architectures (IMA) was exposed to a High Intensity Radiated Field (HIRF) environment in a mode- 
stirred electromagnetic reverberation chamber [54, 87, 88, 90, 91]. The purpose of the experiment was to 
gain insight into the response of the system to a wide range of internal faults, including conditions that 
exceed the design safety margins. There is special interest in examining the response to functional 
upsets, which are error modes that involve no permanent component damage, can simultaneously occur 
in multiple channels of a redundant distributed system and can cause unrecoverable distributed state error 
conditions [9, 29, 38]. 

The fault injection experiment was divided in two parts. The HIRF Susceptibility Threshold 
Characterization (HSTC) experiment was intended to identify and examine factors that determine the 
measured minimum HIRF field strength level at which a particular electronic System Under Test (SUT) 
begins to experience HIRF-induced interference to its internal operation (i.e., faults). The results and 
lessons learned in the execution of the HSTC experiment are described in report [88]. The HIRF Effects 
Characterization (HEC) experiment was intended to assess the system response to functional system 
upsets. Different system configurations were tested with variations on the communication data rate, the 
degree of redundancy, and the number of simultaneously irradiated components. The objective was to 
characterize the effect of a HIRF environment on the behavior of the system and its components. The 
characterization will consider the effects at the external system interfaces and at the interfaces of internal 
components. Of special interest is determining the severity of component faults and assessing the 
robustness of the system to multiple simultaneous faults. We would like to identify weaknesses in the 
design of the system and desirable features for more robust communication systems. The test results are 
expected to contribute to the development of redundancy management mechanisms and policies for 
robust processing architectures. 
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This report describes the approach to assess the HEC-experiment fault effects at the interfaces of the 
system and its components. We also expect that the approach will help us gain insight into the relation 
between the severity of internal faults and the propagated effects. However, a thorough understanding of 
that relation is outside the scope of this assessment approach and will be the subject of future work using 
error-propagation system models to perform simulated fault injection experiments. 

The characterization of fault effects is based on the concept of resilience. In [55], Laprie defines 
resilience as “the persistence of service delivery that can justifiably be trusted, when facing changes.” In 
[59], Leveson states that resilience is often defined as “the ability to continue operations or recover stable 
state after a major mishap or event”. We are interested in an objective and quantitative characterization of 
fault effects. For this, we define a set of metrics to measure various dimensions of error manifestations. 
We also define composite metrics that integrate the error dimensions. These metrics are described in 
detail in subsequent sections. 

This report is organized as follows. The next section reviews background concepts that will be used 
later in this report. This is followed by a description of the approach for the assessment of resilience. 
Severity metrics for faults and their effects are presented after that. This report concludes with a 
summary of accomplishments and an overview of the plan to analyze the data collected in the HEC 
experiment. The appendix has sketches of proofs showing that the newly defined error metrics satisfy the 
required mathematical properties. 


2. Background Concepts 

This section is a review of concepts relevant to the presentation in later sections. It covers the 
definition of a system, threats to achieving dependable service, the desired attributes for dependable 
systems, a brief description of the design process, and an overview of the concept of a metric, including 
the required mathematical properties. 

2.1. System 

For our purpose, a system is an entity that consists of an arrangement of components and interacts 
with its environment (i.e., other entities and the natural physical world) at external interfaces to perform a 
specific function [4, 42], The environment defines the boundaries of the system. A system can be 
specified at various levels of abstraction (i.e., with varying amount of detail) in three domains: 
behavioral (i.e., in terms of the input-output response, without reference to implementation), structural 
(i.e., in terms of an interconnection of more primitive functional components, without reference to the 
external system-level function), and physical (i.e., in terms of physical components and physical 
characteristics, without reference to functionality). A system viewed as a black box is described in the 
behavioral domain in terms of inputs and outputs and the relation between them. In a white-box view of 
a system, the internal functional structure is visible and the system can be described in terms of the 
interaction among the components. The structural description of a system is recursive in the sense that 
each component is itself a system with its own function and structure. The recursion stops when a level is 
reached at which it is not possible, or of interest, to further decompose a component, and thus the 
component can be thought of as atomic or primitive (i.e., as a black box). 

A distributed computation system consists of a set of processing elements (or nodes) interconnected 
by a communication network [50]. The nodes communicate with each other by sending messages over 
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the network to exchange data and coordinate their actions in order to achieve a common functional goal. 
A system is said to be synchronous if it performs its function within finite and known time bounds; 
otherwise, the system is said to be asynchronous [42], This concept applies both to computation and 
communication systems. 

The actions of a system are triggered by the events (i.e., changes in state [50]) of internal and external 
signals. A time-triggered system is triggered (or driven) by the progression of a clock that generates a 
sequence of events equally spaced in time (i.e., the events are periodic and constitute a measure of 
elapsed physical time). The most stringent time-related requirements for a computer system originate in 
applications involving the control of physical systems such as engines, airplanes, or industrial plants. 
These applications require sampling the state of the controlled process at regular intervals, followed by 
the computation and application of suitable control commands with strict timing constraints. A 
distributed time-triggered system requires a clock synchronization mechanism to establish a common 
time base at the processing elements, which can then be used as the foundation for coordinated action to 
deliver the required system-level function [50, 86, 93] . 

The behavior of a system is its sequence of outputs in time [50]. The service delivered by a system is 
the behavior as perceived by its user [4], A user is a system that receives the service. In [72, 73], Powell 
defines a service as a sequence of service items, each characterized by a value (or content) and a time of 
observation. A service item is correct if the existence of the item was actually specified and its value and 
time are within the specified set of allowed values and time interval for the service item. In general, the 
specification of a service item depends on the history of inputs to the system. The correctness of a service 
can be accurately judged by an omniscient observer that has complete knowledge of the sequence of 
service items that should be delivered according to the specification. A real observer may have to rely 
on incomplete knowledge to derive expected (or acceptable) value and time sets for the service items. 

A system may deliver a single service to multiple users. In this case, the service is defined as a 
sequence of replicated service items [72, 73]. This type of multi-user (or broadcast) service consists of 
a sequence of broadcast service items delivered to a set of users, with each broadcast service item 
consisting of a set of single-user (or simplex) service items, with one simplex item per user. A broadcast 
service item is correct if all the simplex service items are correct and there is consistency (i.e., agreement 
or symmetry) between every pair of simplex items. In the value domain, two types of agreement are 
possible: exact or approximate (i.e., inexact). Two values are in exact agreement if they are exactly 
equal. Approximate agreement is defined with respect to a specified error bound, such that two values 
are in approximate agreement if the difference between them is smaller than or equal to the error bound. 
In the continuous physical time domain, only the concept of approximate agreement is defined. Two 
simplex service items are in agreement if their value and time elements are in agreement. The selection 
of either exact or approximate value agreement in the specification of a broadcast item is dependent on 
the nature of the service being delivered. 

2.2. Defects and Failures 

Next we review service failure terminology, categories and models. 

2.2.1. Fault, Error, Failure 

The service delivered by a system can be either correct or incorrect. A service failure event is a 
transition from correct to incorrect service [4], A service restoration event is the transition from 
incorrect to correct service. A service outage is an interruption in correct service delivery lasting from 


3 



the time of the failure to the time of service restoration. For the purpose of this report, it is useful to 
model a service outage as an error burst, which is an incorrect subsequence of service items . The 
service is correct and said to be available before and after an error burst. 


The terms fault, error, and failure are used to describe a cause-and-effect relationship between 
undesired circumstances in the context of the hierarchical composition of a system. Failure is assessed at 
the external interface of a system and is determined by deviations from the behavior expected according 
to the specification. An error is a deviation from the intended value and/or timing of data somewhere in 
a system. A fault is a defect in a system component that is the cause of errors. A fault in a system 
corresponds to a failure of a component. The fault, error, and failure terms facilitate the structured 
analysis of the failure characteristics of a system and the determination of failure causality chains from 
low-level components to higher-level components. In a simple chain, the failure of a system is due to the 
presence of errors in it, which are caused by one or more faulty components that failed to deliver the 
intended service. At this point, a faulty component can be seen as a failed system and the failure causality 
chain can be expanded by further exploring the hierarchical structure. The chain ends when a component 
is reached beyond which no internal structure can be discerned or is of interest [4], 

2.2.2. Fault Classification 

Faults can be classified according to a multitude of criteria. Avizienis et al. [4] proposed a fault 
taxonomy based on the following classification criteria: phase of creation (either development or 
operations), system boundaries (internal or external defect), phenomenological cause (natural or human- 
made), dimension (hardware or software defect), objective (malicious or non-malicious), intent 
(deliberate or non-deliberate), capability (accidental or incompetence), and persistence (permanent or 
transient). Suri et al. [85] proposed the following fault classification criteria: activity (either latent or 
active, i.e., generating errors), duration (permanent or transient), perception (symmetric or asymmetric, as 
manifested at the service users), cause (random or generic, i.e., systemic), intent (benign or malicious, i.e., 
detectable or not by the users), count (single or multiple), time of multiple faults (coincident or distinct), 
and cause of multiple faults (independent or common mode, i.e., same or different causes). 

For characterizing the resilience of a system, we prefer fault classification criteria more suitable to the 
analysis of system effects. In general, a system is a recursive composition of (sub-)systems, and the main 
purpose of a system (as well as a sub-system) is to deliver a service to it users, which are other systems. 
The output service of a system is the input service of another. Thus, we focus on fault classification 
criteria that characterize the service delivered by a component. Our preferred service classification 
criteria are the following. 

■ Correctness: Whether the service delivered is correct or incorrect. 

■ Inline detectability: Whether a user can independently detect incorrect service. If a user can detect 
input service errors using inline acceptance checks (e.g., coding, timing and reasonableness checks 
[42, 94]), then it may be able to take appropriate actions to prevent the propagation of the errors to 
its own computation. 


More precisely, an error burst is a sequence of service items in which the first and last are incorrect and there is 
not a subsequence of g or more correct service items within the burst [32]. An outage is preceded and followed by 
sequences of at least g correct service items. The parameter g is the guard band of the burst, and its value may vary 
depending on the purpose and specifics of the service failure analysis performed. 
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■ Symmetry: Qualitatively, a service failure can be symmetric (i.e., all users receive the same 
service) or asymmetric (i.e., not all users receive the same service). Quantitatively, we can count, 
for example, how many pairs of users observe the same failure manifestations and how many pairs 
disagree on their observations. 

■ Persistence: Qualitatively, a service failure can be permanent or transient. Quantitatively, the 
failure persistence has a specific duration. 

2.2.3. Hybrid Fault Models 

For proper coordinated action, the processing elements of a distributed system must have a consistent 
view of the system state and the results of distributed computations. The processing elements use 
distributed protocols to achieve and preserve valid agreement on the state and data [42, 61]. The 
Omissive -Transmissive Hybrid (OTH) fault model [5, 6, 89, 90, 93] defines a fault classification suitable 
for the analysis of distributed agreement protocols. The OTH fault categories are defined as follows for a 
broadcast service item. 

■ Correct Symmetric (CS): All users accept the same correct simplex service item. 

■ Omissive Symmetric (OS): All users reject the service item. 

■ Transmissive Symmetric (TS): All users accept the same incorrect simplex service item. 

■ Strictly Omissive Asymmetric (SOA): Some users accept the same correct simplex service item 
and others reject the item. 

■ Single-Data Omissive Asymmetric (SDOA): Some users accept the same incorrect simplex 
service item and others reject the item. 

■ Transmissive Asymmetric (TA): The users have other patterns of disagreeing simplex service 
items. 

A user rejects a service items if the item is not received at all or the received item is detectably 
incorrect based on input error detection checks; otherwise, the user accepts a received service item. In 
the OTH model, an incorrect item is omissive if it is detectable by the input acceptance checks at the user; 
otherwise, the item is transmissive. Alternative equivalent terms for omissive and transmissive incorrect 
items are detectable and undetectable by input acceptance checks, respectively. 

Notice that a set of omissive items is symmetric irrespective of the actual content or timing of the 
items because symmetry in this case is assessed based on input unacceptability (i.e., error detectability). 
Also, notice that the SOA category divides the users into two subgroups, one CS and the other OS, so 
each is symmetric on its own. Likewise, the SDOA category divides the users into TS and OS subgroups. 
Additionally, notice that a broadcast item is TA only if there is at least one pair of users that accepts 
disagreeing input items, which can be either one correct and one transmissive, or two transmissive items. 

The OTH fault model is complete in the sense that it covers all possible error patterns for a broadcast 
service item. Furthermore, the fault categories are mutually exclusive and form a partition of the set of 
possible error patterns (i.e., the manifestations of any given service item fall under exactly one of the 
OTH categories). Note that other fault-space partitions not based on the OTH model might be better 
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suited for a particular analysis being performed. 

A service outage consisting of multiple broadcast service items may have error manifestations under 
one or more OTH categories. Often, a service outage is classified based on the worst-case error 
manifestation. With respect to error severity, an incorrect item is considered to be worse than a correct 
item, an undetectable incorrect item is worse than a detectable one, and an asymmetric item is worse than 
a symmetric one. Miner et al. [61] used the following classification for a hybrid fault model. 

■ Good: The service is correct. 

■ Benign: The service items are either correct or omissive symmetric. 

■ Symmetric: The service items may be arbitrary (i.e., correct, omissive or transmissive) but all users 
receive the same service. 

■ Asymmetric: The service items may be arbitrary and asymmetric. 

This classification has a failure semantics (i.e., failure mode) [16] order of increasingly severe 
behavior from Good to Asymmetric, and forms a constrained-behavior hierarchy such that an Asymmetric 
service includes Symmetric service, which in turn includes Benign, which includes Good service. 

2.3. Attributes 

Next, we consider a series of system qualities that are related to and provide a context for the concept 
of resilience in a safety-critical real-time system. Here the qualities of a system are defined in terms of 
the service it delivers. 

2.3.1. Real-Time 

In a real-time (i.e., time-critical) system service, the correctness of the service is determined not only 
by the value of the service items, but also the time of delivery [50] . A hard real-time service must 
always deliver service items within the specified time interval, as there may be severe consequences on 
the users if this constraint is violated. A soft real-time service may fail to deliver service items within the 
specified time constraint, but the utility of the item decreases when the constraint is violated [85]. Some 
systems have firm real-time service requirements in which infrequent timing constraint violations are 
tolerable but may degrade the quality of the service. Some systems may be firm real-time with respect to 
the quality of the service, but hard real-time with respect to safety. For these systems, the quality of the 
service degrades as the update delay increases beyond the firm timing constraint until the hard real-time 
constraint is reached, at which point safety is compromised. This hard real-time delay threshold 
corresponds to the time-to-criticality of a system, which is the time interval between the occurrence of a 
failure and the user or environment reaching an unsafe state. For example, Paulitsch et al. [69] and 
Pimentel [71] reference a design requirement of 50 ms maximum service outage duration for an 
automobile steer-by-wire system. 


Based on definition at http://www.faa.gov/library/manuals/aviation/risk_management/ss_handbook/ 
media/appj_1200.pdf. Accessed September 2, 2012. 
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2.3.2. Reliability 


Reliability refers to the uninterrupted delivery of correct service [4]. Reliability is measured as the 
probability that correct service will continue for a time interval of specified duration and under stated 
conditions [50, 59]. The conditions can be physical environmental conditions (e.g., temperature, 
vibration, etc) in the case of a physical system, and include the configuration of the system, the functional 
input patterns and possibly the types and number of faults experienced by the system. 

2.3.3. Recoverability 

Recoverability is the ability to restore correct service delivery after experiencing a failure. Here we 
use the term recoverability to refer to the ability of a system to restore service on the fly. This falls under 
the larger context of maintainability, which includes physical replacement and repair of system 
components. For our purpose, recoverability is the complement of reliability and is measured as the 
probability that the service is restored within a specified time interval after the occurrence of a failure. 
Suri et al. [85, p. 5] ranked fault-handling strategies by their best achievable recovery-delay performance. 
In order of decreasing minimum recovery delay, these strategies are: diagnosis and reconfiguration 
policies, active and passive replication policies, and fault masking policies. 

2.3.4. Availability 

In this report, we use the term availability to refer to the fraction of time that the delivered service is 
correct. The availability of a service is a function of the reliability and recoverability. Availability is 
highest when both reliability and recoverability are high, as in this case the service remains correct for 
long time intervals and is quickly restored after a failure occurs. 

2.3.5. Integrity 

In a general sense, integrity is related to the concept of truthfulness. Avizienis et al. [4] defined 
integrity as the absence of improper system state alterations. A service item satisfies this condition when 
it is correct, or incorrect but detectable by the user. Integrity is violated when the user accepts an 
incorrect service item. In [69], Paulitsch et al. defined integrity as the probability of an undetected 
failure. Service integrity is an important quality as it is related to the likelihood that the effects of a fault 
in a system will propagate and corrupt other systems. 

For distributed systems, proper coordinated action depends on consistency of state. Integrity in a 
distributed system is violated when users expect to receive the same service but are actually delivered 
different services. 

2.3.6. Safety 

Avizienis et al. [4] defines safety as the absence of catastrophic consequences on the user(s) and the 
environment. In the IEC 61508 standard [41], safety is defined as freedom from an unacceptable 
combination of the probability of the occurrence of injury or damage to people, equipment or the 
environment, and the severity of the occurrence. Kopetz [50] defines safety as the probability that a 
system will survive for a given time interval without a critical failure mode (i.e., a failure mode that can 
lead to catastrophic consequences), and thus, in this sense, safety is reliability regarding critical failure 
modes. 
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For a safety-critical real-time system, safety requirements can be stated in terms of system functional 
quality attributes related to two basic service failure modes: loss of function (i.e., passive failure) and 
malfunction (i.e., active failure) [99 - 102], The system safety requirements can be expressed in terms of 
the availability and integrity of the service delivered to the users. Unavailability of the service with 
preserved integrity is a passive failure that may not be catastrophic if service can be restored before the 
time-to-criticality. An integrity violation corresponds to an active failure and is considered (or assumed 
to be) immediately unsafe (i.e., catastrophic). The ability to suppress incorrect service items at the source 
and to detect incorrect service items at the users are critical determining factors of safety. 

2.3.7. Robustness 

Siewiorek et al. [80J defines robustness as the ability of a system to identify and handle errors (at the 
functional inputs or internal to the system) of varying severity in a consistent and predictable manner. 
Kopetz in [50J states that a system is robust if the severity of the consequences of a fault is inversely 
proportional to the probability of fault occurrence (i.e., frequent faults have less severe effects on service 
quality than infrequent faults). Bishop et al. [1 1] considers a system to be robust if it resists a wide range 
of attacks (i.e., faults) and operational conditions without significant service degradation, but may not 
have the ability to restore lost functionality (i.e., service quality). 

The main aspect of interest to us relative to system robustness is the assessment of service quality over 
a wide range of operational conditions (including number and severity of internal faults). Under these 
conditions, a service is robust if it satisfies Kopetz’ s criterion for robustness (i.e., severity of effects is 
inversely proportional to the frequency of the fault). In general, robustness does not imply ability to 
recover the service, but recovery may be essential for safety-critical real-time systems in order to satisfy 
Kopetz’ s criterion. 

2.3.8. Survivability 

A survivable system has the ability to continue to operate, possibly with highly degraded service, even 
under severe conditions. In essence, a survivable system may be easily degraded but nearly impossible to 
disable completely [1 1J. A survivable system may have the ability to effect some degree of recovery, but 
this is not essential. 

2.3.9. Resilience 

A resilient system may experience degraded operation due to faults, but will eventually recover. In 
[55], Laprie defines resilience as “the persistence of service delivery that can justifiably be trusted, when 
facing changes.” Trivedi et al. [96] states that “resilience deals with conditions that are outside the design 
envelope” and, in general, refers to the ability of a system to resist and recover from shock or strain. In 
[59], Leveson states that resilience is often defined as “the ability to continue operations or recover stable 
state after a major mishap or event”. Bishop et al. [11] states that “a resilient system is effectively a 
survivable system that is capable of restoring not only its performance level back to desirable levels, but 
also the capacity of the system itself to recover, maintaining its ability to sustain future attacks or 
failures.” 

From our perspective, resilience describes the ability of a system to mitigate the effects of component- 
level service degradations. A system is resilient to faults if its service quality is hard to degrade and 
quality is restored after the fault condition has subsided. For safety-critical real-time systems, availability 
(in terms of reliability and recoverability) and integrity are the most significant measures of service 



quality. Therefore, for safety-critical real-time systems, we define resilience as the ability to preserve and 
restore service availability and integrity under stated conditions. 

2.4. Fault Hypothesis 

Depending on the application (e.g., commercial transport aircraft, manned military aircraft, 
autonomous vehicles, engine controls), the probability requirement for critical failure modes of safety- 
critical systems may be in the range of 10 6 to 10 10 per hour [50, 52, 53]. Given that electronic 
components (e.g., chips, circuit boards, computer modules) have failure rates on the order of 10 4 to 10 6 
per hour [50, 52] with failure modes that can be difficult or impossible to characterize [17], physical 
redundancy and suitable redundancy management mechanisms are necessary to meet system safety 
requirements regarding failure modes and rates. 

The first major step in the design of a fault -tolerant system is the specification of the fault hypothesis 
(or fault assumptions) [50]. The fault hypothesis specifies a partition of the system into fault 
containment regions (FCR), which are components assumed to experience defects with a high degree of 
independence (in a probabilistic sense) [49, 52, 53, 65 - 67]. This implies that whatever causes a defect in 
an FCR is unlikely to coincidentally also cause a defect in another FCR, and that a defect in an FCR is 
unlikely to cause a defect in another FCR (i.e., a fault cascade). Thus, the FCRs are the basic units of 
failure in a system [50] . The fault hypothesis states the expected FCR failure modes and rates, as well as 
the maximum number of simultaneously failed FCRs that the system may experience in operation [50, 72, 
73, 85]. (Note that the fault hypothesis is a way of specifying part of the “stated conditions” in the 
definition of system reliability, with other parts being the functional inputs and the configuration of the 
system.) 

The system is designed to meet the functional and service quality requirements while handling the 
fault space defined by the fault hypothesis. The operational fault-handling effectiveness of a system 
depends on two basic factors: the fault-assumption coverage (i.e., the probability that actually occurring 
faults are within the assumed fault space) and the fault-handling coverage (i.e., the probability that 
assumed faults are properly handled by the system) [4, 72, 73]. Thus, the design development is an 
iterative optimization process involving refinements to the fault hypothesis and the fault handling strategy 
and mechanisms. Usually, the fault modes and rates of non-redundant, primitive system components are 
fixed as determined by the implementation technology. The component fault rate is a determining factor 
in the amount of redundancy needed to satisfy the system service availability requirement, and the fault 
modes (i.e., failure semantics) influence the amount and organization of redundancy to satisfy the 
integrity requirement [5, 6, 52, 72, 73]. In general, to satisfy particular availability and integrity 
requirements, higher fault rates necessitate increased redundancy, and less constrained fault modes 
demand increased redundancy and more complex organization. However, using redundancy in hardware, 
software, time and/or information domains [43], it is possible to define higher-level structural components 
with less severe (i.e., safer) failure-mode rate profiles [16]. This approach increases the complexity of 
these higher-level components in exchange for easier-to-handle failure modes, which enables a simpler 
high-level system design. Examples of this include Honeywell’s SAFEbus with self-checking-pair 
components [17], Airbus’ Command-Monitor (COM-MON) computers [13, 94, 95], and the Boeing 777 
primary flight computers with triple internal redundancy in a command-monitor-standby configuration 
[99 - 102]. The design of the B777 flight computers shows that it is possible to constrain the failure- 
mode rates of high-level components while preserving their availability. 

The fault hypothesis divides the fault space into normal and rare fault regions (or subsets) based on 
assumed FCR failure rates [50]. (These regions are also labeled, respectively, expected and unexpected. 
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or credible and non-credible.) A primary system design goal is to achieve fault-handling coverage as 
close as possible to 100% for normal faults. To mitigate the risk of a fault assumption violation, the 
system fault-handling design should also cover a significant percentage of the most likely fault scenarios 
in the rare-fault region. Ideally, a robust system should have fault-handling coverage that is proportional 
to the probability of occurrence of fault scenarios. An example of a fault-handling robustness approach is 
the never-give-up operational strategy of the Boeing 111 primary flight control computer (PFC), which 
required a high probability that the PFC would continue operating as long as there were known good 
resources and that it would recover from temporary failures [85, p. 12]. According to Kopetz [49], in a 
properly designed system, a likely scenario for a fault-hypothesis violation is a transient correlated failure 
of multiple FCRs. For such scenarios, a robust safety-critical real-time system should recover to an 
operational state with high probability. The ability of a system to recover from any arbitrary state is 
called self-stabilization [1,3, 33, 34], 

2.5. Measurements and Metrics 

Measurement is the process of determining the amount (i.e., degree, size or extent) of some property 
present in an entity. There are four basic scales of measurement [83]: nominal, ordinal, interval, and 
ratio. A nominal scale defines categories (or classes) of the property of interest in terms of exemplars 
and/or descriptions of membership in a category. A measurement on a nominal scale simply assigns a 
category by determining equality with members of the category. An ordinal scale adds a ranking 
relationship between nominal categories such that it is now possible to compare the amount of the 
property of interest in any two categories and determine which is greater. An interval scale defines the 
difference (or “distance”) between any two entities in their amounts of the property of interest. This 
requires the definition of a unit of measurement, which is an accepted or standard amount such that any 
quantity of the property of interest can be expressed as a multiple of it. In an interval scale, the definition 
of the zero amount point is arbitrary or by convention, and therefore, the ratio between numbers on an 
interval scale is meaningless. A ratio scale introduces the concept of an absolute zero. A ratio scale is 
the kind commonly used in physics and engineering, and requires the definition of the four relations: 
equality, rank, difference, and ratio. 

A metric is a mathematical function that defines, for every pair of elements in a set, how far apart the 
elements are from each other (i.e., the distance between the elements) [36]. Thus, a metric is defined on 
an interval scale with a suitable unit of measurement. Let x, y, and z denote elements in a set S, and let d 
denote a function defined on the set S. Function d is a metric if it satisfies the following properties. 


Non-negativity: 

d(x, y) > 0 

Symmetry: 

d(x, y) = d(y, x) 

Identity of Indiscernibles: 

d(x, y) = 0 if and only if x = y 

Triangle Inequality: 

d(x, z) < d(x, y) + d(y, z) 


3. Resilience Assessment Approach 

We have defined resilience in safety-critical real-time systems as a measure of the ability to preserve 
and restore service availability and integrity under stated conditions. The statement of conditions 
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specifies the system configuration and the functional inputs, as well as the threats to the delivery of proper 
system service. These threats may be described as fault conditions occurring internal to the system, or as 
external environmental conditions (e.g., HIRF, lightning, high-energy particle radiation, power system 
transients, etc) that may cause faults in the system. In what follows, we treat the configuration and 
functional inputs as given, and we focus on the relation between the threat conditions and the quality of 
delivered services. 

The system can then be viewed from a stimulus-response (i.e., cause and effect) perspective (see 
Figures 1 and 2). The threat conditions specify the stimulus space, which is a subset of all possible 
system threat patterns. A disturbance is an external system stimulus that may cause a perturbation, 
defined here as an internal fault condition in the form outages on the services provided by the 
components. Alternatively, we can skip the specification of the disturbance and specify the stimulus as a 
perturbation. The effects of a perturbation (i.e., errors) may propagate throughout the system and reach 
the external functional interface, thus causing an outage on the external system service, which we refer to 
as a disruption. The response space is the set of system disruptions resulting from the application of the 
stimulus space. 
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Figure 1 : Stimulus-Response System Model 

As Figure 2 suggests, for a given perturbation, the severity of the disruptions is determined by the 
error propagation characteristics of the system. Thus, from this perspective, resilience can be defined as 
the ability to contain the propagation of internal errors and to repair propagated errors. Note that these 
two aspects of resilience (i.e., containment and repair) correspond to the two key system attributes of 
integrity and availability. 
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Figure 2: Stimulus-Response Chain 

For a quantitative system resilience analysis, the disturbance, perturbation and disruption spaces may 
be described in terms of probability distributions (PDs) of random variables whose values correspond to 
the severity of occurrences (i.e., items, events or instances) in these spaces. Figure 3 illustrates the use of 
PDs to describe the spaces in the stimulus-response chain. To enable the use of such distributions in 
analyses, we need to define severity metrics for the occurrences in each space. Report [87J offers a 
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simple occurrence severity model for a HIRF disturbance. For perturbations and disruptions, we use the 
concept of corruption, which we define as the amount of error in a service outage. Sections 4 and 5 in 
this report present our proposed corruption metrics for disruptions and perturbations. Given a stimulus 
probability distribution, we can perform a stimulus injection experiment (e.g., a Monte Carlo experiment) 
to generate a response set, for which we can then compute the probability distribution. 


Disturbance Space Perturbation Space 

Probability Probability 




Disruption Space 

Probability 



Figure 3: Description of spaces in the stimulus-response chain using probability distributions 

If we have a PD-based description of a space, we can compute basic statistical measures like 
percentiles (e.g., minimum, median, maximum and quartiles) and averages (e.g., mean, standard deviation 
SD, and root-mean-square RMS) to characterize the distribution. Note that a PD-based description of the 
spaces in the system stimulus-response chain is, in effect, an abstract system description that is suitable 
for quantitative comparison of system quality attributes for different systems or the same system under 
different conditions. 

Our preferred statistical measure for these error or error-inducing spaces is the RMS value of the 
distribution. The RMS value of a space (i.e., a set) described by the probability distribution of the 
severity of occurrences is defined as follows. We consider the case of a discrete-valued severity range, 
but the description is similar for a continuous range, using an integral instead of a summation. Let A be 
the index of the severity levels in the space, with 0 < A < A max , and let s x and p, denote the magnitude and 
probability of the A-th severity level. The RMS severity S rms is given by: 


/Imax ^ 

^ rms — i P A ’ ® A 

V A = 0 


The RMS value has the interesting property that it can be expressed in terms of the mean S mean and 
standard deviation Ssd as follows: 

o2 r2 . o2 

^rms — ^mean ' ^SD 

The reason we prefer the RMS value to the mean is that if we compare two distributions with the same 
mean, the one with the largest dispersion as measured by the standard deviation has a larger proportion of 
occurrences at higher severity levels. Intuitively, we consider that distribution to have a higher aggregate 
degradation severity, and we want our statistical measure to reflect that. The RMS value is a more 
conservative measure of the occurrence severity distribution than the mean. Of course, a single measure 
is usually not an adequate characterization of a distribution, and we should always consider the other 
statistical measures, including averages and percentiles. 
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Let s /JT , ax (= s max ) denote the highest disruption severity. We define the RMS severity of the disruption 
space as the RMS corruptibility Q rms of a system for a given stimulus space. Q is an aggregate measure 
of the deterioration in system service quality due to the disturbance or perturbation stimulus. That is the 
opposite of resilience. Thus, we define the RMS resilience R rms as the complement of corruptibility: 

Rrms "t" Qrms ■"'/.max ( 1 ) 


We normalize the severity scale to simplify the interpretation of these measures. The normalized 
RMS corruptibility and resilience, denoted q ims and r rms , are given by: 


qrms Qrms/ S/ max 

f ins Rmis/S^max (S^max " Qnns)/'Nn 


( 2 ) 

(3) 


Equation (1) then becomes: 

rrms + qrms = 1 (4) 

The following sections describe our proposed disruption and perturbation severity metrics. 


4. Service Disruption Metrics 

We seek meaningful measures for the amount of error in a service disruption, which is an error burst 
consisting of a sequence of one or more service items, some of which are in error. We would like the 
error measures to be useful in error propagation analyses, especially for synchronous distributed systems, 
where the validity and agreement of the data and state are paramount [61, 93]. Based on our system 
analysis experience, the OTH fault model is defined at a suitable level of abstraction for service items [61, 
89, 93]. Thus, in accounting for service errors, we choose the service item as the lowest level of 
granularity and we abstract out the service item dimensions of value and time. Furthermore, for a single- 
user service, the OTH model describes a service item based on two criteria: correctness and detectability. 
For a multi-user service, the OTH model for a broadcast service item adds the dimension of symmetry for 
the classification of error patterns. To describe an error burst, we also need to consider the dimension of 
persistence (i.e., duration) of the burst. The selected strategy to measure the disruption error is to define a 
separate dimensional-error metric for each of these criteria and then properly combine the dimensional 
metrics to form a composite total-error metric. Note that depending on the purpose of a particular 
resilience analysis being performed, the disruption error may be measured in terms of one of the basic 
dimensional errors or some function of these. 

We assume a synchronous user model by which the timeline is partitioned into a complete set of 
mutually exclusive time intervals such that there is exactly one item expected for each interval. The range 
of each time interval coincides or extends beyond the correct time range of the corresponding service 
item. In the case of a multi-user service, all the users are assumed to have perfect mutually synchronized 
time intervals. This model accounts for errors in which the number of items a user receives in a time 
interval is fewer or greater than expected. In the assumed user model, such errors are allocated to the 
expected service item. 
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4.1. Single-User Service Item 

Figure 4 illustrates the model for a single-user service. With respect to correctness, a delivered 
service item at the user can be either correct or incorrect. With respect to detectability by the input 
acceptance check, the item can be either detectable or undetectable. As shown in Figure 4, a service item 
x can be expressed in terms of correctness x c and detectability x D “coordinates”, both of which are 
Boolean variables. To quantify the values in each of these dimensions, we use the same basic idea as in 
the Hamming distance [98], where the unit of measurement is a disagreement (A) between values. This 
way the distance between correct and incorrect is assigned the numeric value of 1, and the distance 
between detectable and undetectable is also 1. Let d c ’(x, y) and d D ’(x, y) denote the distance between 
items x and y in the dimensions of correctness and detectability, respectively, assuming independence 
between the dimensions. Table 1 shows the values of d c ' and d D ’ for all possible combinations of 
correctness and detectability. 


Service User 



Figure 4: Single-user service model with correctness and detectability dimensions 


Table 1: Correctness and detectability distance functions for single-user service items assuming dimensional 

independence 
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In theory, the correctness and detectability dimensions of the model can be independent, but in 
practice, a correct item should not be declared invalid by the input acceptance check as that effectively 
reduces the availability of the system. Here we assume that the acceptance check never invalidates a 
correct service item (i.e., no false-positives). However, it is not always possible to achieve perfect inline 
detectability, which means that some incorrect items may not be detectable (i.e., possible false-negatives). 
This is related to the trade-off between correctness and completeness in system diagnosis, where a 
correctness-biased diagnosis policy will not declare bad any good component but may have to allow some 
bad components to be declared good, whereas a completeness-biased diagnosis policy identifies all bad 
components but may also declare bad some components which are actually good [93, p. 27]. Therefore, 
in effect, our service model assumes a correctness policy for input acceptance checks. 

To measure the total service-item error for correctness and detectability, we need to consider the 
interaction between these dimensions taking into consideration the above discussion. In particular, a 
correct item must not be said to have an error in the detectability dimension. An incorrect item may be 


14 







either detectable or undetectable. Thus, there are only three outcomes for a service item: correct (denoted 
C), detectable incorrect (D), and undetectable incorrect (U). Let d c (x, y) and d D (x, y) denote the distance 
between items x and y in the dimensions of correctness and detectability, respectively, with dependence 
between the dimensions. d C o(x, y) denotes the correctness-and-detectability distance between items x and 
y. The value of d C D is the sum of the distances in the correctness and detectability dimensions (i.e., the 
Manhattan Distance ). 

dcD(x, y) = d c (x, y) + d D (x, y) (5) 

Table 2 shows the values of d c and d D for all combinations of x and y. Note that in order to 
accommodate the dependence between the dimensions, we set d D (C, D) = 0 despite the fact that C is not 
detectable while D represents a detectable item. This is justified by the fact that there is no misdetection 
for either C or D. 

Table 2: Correctness and detectability distance functions for single-user service items with dimensional dependence 


X 

y 

d c (x, y) 

d D (x, y) 

d C D(x, y) 
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To measure the total amount of correctness and detectability error, we define C as the zero-error 
reference item. Let e c and e D denote the correctness and detectability dimensional errors, respectively, 
which are defined as follows. 


ec(x) = d c (x, C) (6) 

e D (x) = d D (x, C) (7) 

Then the total correctness-and-detectability error, denoted e CD , is given by the following expression. 

e CD (x) = e c (x) + e D (x) = d CD (x, C) (8) 

Table 3 breaks down of correctness and detectability error for a single-user service item. Notice that 
the case of correct and detectable is assumed not to occur. Also, notice the correspondence between the 
combination of correctness and detectability and the attributes of availability and integrity. Specifically, 
an item is available if it is correct, and it has integrity if it is either correct or detectable incorrect. An 
undetectable incorrect item has a total error count of 2 because it is both incorrect and undetectable. 


Table 3: Correctness and detectability errors for a single-user service item 


X CD 

x c 

x D 

Availability 

Integrity 

e c 
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From Table 3, it is easy to see that the following relations are valid. The proofs are given in Appendix 


A. 


d c (x,y) = le c (x) - e c (y)l 

(9) 

d D (x,y) = le D (x) - e D (y)l 

(10) 

dcD(x.y) = le c (x) - e c (y)l + le D (x) - e D (y)l 

(ID 

dcD(x,y) = le CD (x) - e CD (y)l 

(12) 


4.2. Multiple-User Service Item 

For multi-user service, we must account for the error in the dimension of symmetry among the 
simplex items. The total error, then, includes the dimensions of correctness, detectability and symmetry. 

We begin by defining the error for the dimensions of correctness and detectability using the concepts 
in the definition of error for a single-user service item. We then define the symmetry error for 
approximate and exact agreement. Finally, we define the total error as a combination of the dimensional 
errors. 

4.2.1. Correctness and Detectability 

We use the letter n to denote the number of service users. A multi-user service item X is a vector of 
simplex service items denoted by x ; for 1 < i < n. Thus: 

X = (xi, x 2 , ...,x n ) 

In the correctness and detectability dimensions, the distance between two multi-user items X and Y is 
given by the sum of the distances between respective elements. 


d c (X, Y) = d c (xj, yj) + .. 

. + d c (x n , y n ) 

(13) 

do(X, Y) = d D (x!, yi) + . 

. . + d D (x n , y n ) 

(14) 


The total correctness-and-detectability (CD) distance is also given by the sum of the distances between 
the elements. 


dco(X, Y) - d C o(xi, yi) + ... + d C D(x n , y n ) (15) 

The correctness and detectability dimensional errors of service item X are given by the sum of the 
errors of its elements: 


e c (X) = e c (xi) + . . 

■ + e c (x n ) 

(16) 

e D (X) = e D (x!) + . 

■ • + e D (x n ) 

(17) 
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The total CD error is given by: 


Ccd(X) - eco(xi) + . . . + ecD(x„) (18) 

A multi-user service item X in the correctness and detectability model can be expressed as vector X C d 
= (Xi.crn x 2 ,cd> • • ., x n CD ). Because the correctness-and-detectability error is commutative, the order of the 
elements in X C d is not significant in the computation of e C D- Based on this, X CD can expressed using the 
following compact notation, with b + p + w = n. 

X CD = C b D p U w (19) 

Therefore, e C o(X) can also be expressed as follows. 

e CD (X) = [be CD (C)] + [pe CD (D)] + [we CD (U)] = p + 2w (20) 

4.2.2. Symmetry 

Approximate agreement and exact agreement have fundamentally different relational structures. In 
defining the symmetry error for a multi-user service item, we must consider these structures in defining 
the “amount of error”. Thus, we define separate error functions for each type of agreement. 

For both approximate and exact agreement we assume that all correct simplex items are in mutual 
agreement (i.e., any C is in agreement with every other C and C n is symmetric), all detectable incorrect 
items mutually agree (i.e., any D is in agreement with every other D and D n is symmetric), and detectable 
incorrect items are in disagreement with all other kinds of items (i.e., Ds do not agree with Cs nor Us). 

4.2.2.I. Approximate Agreement 

Figure 5 illustrates the pair-wise approximate agreement relations for n = 4. In general, there are a 
total of n(n + l)/2 pair-wise (i.e., binary) relations that define the agreement pattern for an n item set. In 
Figure 5, a link represents agreement between the pair of connected items. The graph is fully connected 
for a symmetric set. An asymmetric set has one or more missing links. 


Symmetric 



Asymmetric Asymmetric 




Figure 5: Examples of symmetric and asymmetric approximate-agreement patterns for n = 4 

Flowever, the links in Figure 5 may not be independent. To see this, consider the concept of 
approximate agreement for a set of n items in which each item is mapped to a point on a number line, 
which can be either integer or real valued. The items can thus be sorted by increasing magnitude. A pair 
of items is in agreement if their distance on the line is smaller than or equal to a specified bound, denoted 
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£. Figure 6 illustrates the pair-wise agreement relations when the individual items are placed in order on a 
number line. A link between nodes i and j is denoted ly. The set of all possible links, denoted L, is given 
by: 


L = {ly , 1 <i < n, 1 < j < n, i } 

This set is composed of all links between first (i.e., immediate) neighbors, all links between second 
neighbors, and so on, with the last link being between the nodes at the extreme ends. On a number line, a 
link ly is said to cover link l k m if the segment from k to m is included in the segment from i to j . Notice 
that if a link is missing (e.g., li, 2 ), then all longer links that cover the missing link are also missing (i.e., li, 3 
and li, 4 ). This corresponds to the property that, if items 1 and 2 disagree, then all items at or to the left of 
1 disagree with all items at or to the right of 2. This property does not apply in the opposite direction 
from longer links to covered shorter links. For example, if 1 13 is missing, 1 12 and 1 2 , 3 may still be present. 

Considering these agreement graphs, we can think of two different approaches to measure the 
symmetry distance between two multi-user service items. One approach is to focus on the individual 
links as independent units of symmetry and define the symmetry distance between two multi-user items 
as the number of agreement links in which they differ. For example, patterns (b) and (c) in Figure 6 differ 
in links 1 1>2 , 1 13 , 1 34 , and 1 2 4 , so the distance between the graphs is 4. The second approach is to focus on 
the clusters of fully connected subsets of individual items and compare different multi-user items based 
on the relative size of their clusters. In Figure 6, graph (a) has a single 4-item cluster, and both graphs (b) 
and (c) have clusters of size 3 and 1. Comparing cluster sizes we can say that graph (a) has a greater 
degree of symmetry and graphs (b) and (c) have equal symmetry, but it is not obvious how to define the 
distance between the patterns. We address these two symmetry-distance approaches separately. 

(a) Symmetric 


_ 777:0 ► 

1 2 "V” 4 

(b) Asymmetric 

— o c C,_ '" O '" V o > 

1 2 ”3’” 4 

(c) Asymmetric 

— cGtT — 1 i O ::: ^o o ► 

1 '"2 3 4 

Figure 6: Examples of approximate-agreement patterns on a number line for n = 4 
4.2.2. 1.1. Link-based Distance 

We define ly to be a Boolean variable that is TRUE (T) when node i is in approximate agreement with 
node j, and FALSE (F) if otherwise. L can then be represented as a Boolean vector L = (. . . , ly, . . .). To 
define the link-based symmetry distance between two multi-user service items X and Y, we again borrow 
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the basic idea used in the definition of the Hamming distance. We specify the unit of link-based 
symmetry distance as a disagreement (=£) between corresponding links in X and Y. Link ly in X and Y is 
denoted l x ,i,j and l Y ,i,j, respectively. d s (lx,i,j. lyaj) denotes the symmetry distance between items X and Y 
with respect to link ly, such that d s (lx,ij, lyaj) = 1 if lx,i,j 4 ly,i,j and d s (lx,i.j. ly,i,j) = 0 if l x ,i,j = l Y ,i,j. The 
symmetry distance between X and Y is given by: 

d s (X, Y) = Zij d s (l Xji j, l Yj y) (21) 

with 1 < i < n, 1 < j < n, i^j. 

The total symmetry error es of a multi-user service item X is given by the distance from a symmetric 
item, represented here by C n for convenience. 

e s (X) = ds(X, C n ) (22) 

4.2.2. 1.2. Cluster-based Distance 

The approximate agreement graph for a multi-user item can have up to n clusters. We use the symbol 
a to denote the size of a cluster. Thus, for cluster-based symmetry, a multi-user service item X can be 
represented by a vector of n cluster sizes: 

X s = ((*!, ... ,a n ), 

where the oti elements are sorted by decreasing value such that 0Ci > 0 t 2 > . . . > a n . Because some of the n 
single-user items of the multi-user service item X may be in multiple clusters, the cluster sizes may vary 
such that otj > 0 and 0Ci + . . . + oc n > n. 


We have been unable to gain sufficient insight into this cluster-based representation to identify an 
objective unit of measure to define a symmetry distance function. Therefore, we resort to define an 
ordinal scale based on ranking rules for vectors of cluster sizes. To do this, we introduce the concept of 
dominance. Let X s = (a x ,i, ... , ot x , n ) and Y s = (a Y .i, ... , a Y , n ) be the cluster-size vector representations 
of multi-user items X and Y. We say that X s dominates Y s if there is a vector-element index i, 1 < i < n, 
such that a x ,i > oty i and a x ,j = cx Yj for 1 < j < i. If X s dominates Y s , we say that X has more symmetry 
than Y, or equivalently, Y has larger symmetry error than X. 

To define the cluster-based symmetry scale, we have to list all possible cluster vectors sorted by 
dominance, and then form a one-to-one mapping from the dominance scale to the natural numbers 
beginning with a value of 0 for the symmetric case, i.e., X s = (n, 0, ... , 0). Table 4 shows the cluster 
vectors with n = 4 ranked by dominance and their assigned symmetry error values. The symmetry 
distance between items X and Y is given by their distance on the error scale. 

d s (X, Y) = le s (X) - e s (Y)l (23) 
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Table 4: Cluster-based symmetry error scale for approximate agreement (n = 4) 


Cluster Vector 

Symmetry Error (e s ) 

(4, 0, 0, 0) 

0 

(3, 3, 0, 0) 

1 

(3, 2, 0, 0) 

2 

(3, 1,0, 0) 

3 

(2, 2, 2, 0) 

4 

(2, 2, 1,0) 

5 

(2, 1, 1,0) 

6 

(1,1, 1,1) 

7 


4.2.2.2. Exact Agreement 

We have not been able to identify an objective unit of measure on which to base a symmetry distance 
function for exact agreement. Therefore, we proceed with a similar approach as for cluster-based 
symmetry for approximate agreement. 

Again, a multi-user service item X can be represented by a vector of n cluster sizes: 

X S = (CXi, ... , (Xn) 

where the a; elements are sorted by decreasing value such that oq > oq > ... > a„. For exact agreement, 
the simplex items of a multi-user service item form a partition (i.e., a set of mutually exclusive clusters 
incuding all the individual items) as now an item can belong to at most one cluster. The cluster sizes 
under exact agreement may vary such that cq > 0 and oq + ... + a„ = n. Thus, in effect, the possible 
cluster vectors for an n-user service item are given by the set of integer partitions of n [36]. Table 5 lists 
the integer partitions for n = 4 ranked by dominance and their corresponding symmetry error values. 
Equation 20 also applies here for the exact-agreement symmetry distance d s between multi-user service 
items. 


Table 5: Cluster-based symmetry error scale for exact agreement (n = 4) 


Cluster Vector 

Symmetry Error (e s ) 

(4, 0, 0, 0) 

0 

(3, 1,0, 0) 

1 

(2, 2, 0, 0) 

2 

(2, 1, 1,0) 

3 

(1,1, 1,1) 

4 


4.2.3. Correctness, Detectability and Symmetry 

The total distance between two multi-user service items over the combined dimensions of correctness, 
detectability and symmetry must take into account any possible dependence between the dimensions. Our 
definition of the correctness-and-detectability (CD) distance already accounts for the dependence between 
correctness and detectability. In defining the contribution of the symmetry dimension, we would like our 
distance function to ignore differences already accounted for in the definition of the CD distance. In 
particular, note that a D item always disagrees with C and U items. In addition, for exact agreement, a C 
item always disagrees with a U item. Thus, these “built-in” disagreements should not contribute to the 
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symmetry component of the total correctness-detectability-and-symmetry (CDS) distance. We consider 
approximate and exact agreement separately. 

For approximate agreement, it is possible for a C to agree with a U. This relation must be accounted 
for in the CDS distance. We use X A to denote the acceptable single-user items (i.e., items that pass the 
input acceptance check at the users) of a multi-user service item X. X A can include C and U items. Using 


our compact CD notation for a vector, X can be expressed as follows. 

X CD = C b D p U w = D p X a (24) 

where: 

X A = C b U w (25) 

We use the cluster-size notation to represent the agreement pattern in X A , which we denote X A>S . 
X A ,S — (CtA.1, - --5 OtA,b+w) (26) 

The total CDS distance between multi-user service items X and Y is given by: 
d CDS (X, Y) = d c (X, Y) + d D (X, Y) + d s (X, Y) (27) 

with: 


d s (X, Y) - ds(X A , s , Y a ,s) = le s (X A , s ) - e s (Y A , s )l 


(28) 


where es(X AjS ) and es(Y A , s ) are computed on their respective symmetry error scales. 

The CDS error in item X is given by: 

ecDs(X) - d CDS (X, C n ) - d c (X, C n ) + d D (X, C n ) + d s (X, C n ) 


ecDs(X) - ec(X) + e D (X) + es(X Aj s) - Ccd(X) + es(X A ,s) (29) 

For exact agreement, C and U items never agree. Therefore, the symmetry distance contribution is 
only with respect to the U items in the multi-user service item. Let X L; denote the undetectable incorrect 
items in X. For exact agreement, X can be expressed as: 

X CD = C b D p U w = C b D p Xu (30) 

with: 


X u = U w (31) 

Using cluster-size notation, the exact agreement pattern in X (J , denoted X UjS , is expressed as: 

Xu.s = (Otu.n ••• , OCu.w) (32) 

Equations (27) to (29) also apply to exact agreement with X AS replaced by X u s . Table 6 shows the 
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total CDS error for patterns of OTH faults assuming exact agreement and n = 4. 


Table 6: Total CDS error with exact agreement for patterns of OTH faults (n = 4) 


OTH Category 

Xcd 

Xu.s 

e c 

Cd 

es(Xu.s) 

CCDS 

Correct Symmetric (CS) 

C 4 

(0, 0, 0, 0) 

0 

0 

0 

0 

Strictly Omissive Asymmetric (SOA) 

C J D 

(0, 0, 0, 0) 

1 

0 

0 

1 

C 2 D 2 

(0, 0, 0, 0) 

2 

0 

0 

2 

CD 4 

(0, 0, 0, 0) 

3 

0 

0 

3 

Omissive Symmetric (OS) 

D 4 

(0, 0, 0, 0) 

4 

0 

0 

4 

Single-Data Omissive Asymmetric (SDOA) 

D 3 U 

(1,0, 0,0) 

4 

l 

0 

5 

dTj' 

(2, 0, 0, 0) 

4 

2 

0 

6 

DU 3 

(3, 0, 0, 0) 

4 

3 

0 

7 

Transmissive Symmetric (TS) 

u 4 

(4, 0, 0, 0) 

4 

4 

0 

8 

Transmissive Asymmetric (TA) 

u 4 

(3, 1,0,0) 

4 

4 

1 

9 

u 4 

(2, 2, 0, 0) 

4 

4 

2 

10 

u 4 

(2, 1, 1,0) 

4 

4 

3 

11 

u 4 

0,1,1, 1) 

4 

4 

4 

12 

C 3 U 

(1,0, 0,0) 

1 

1 

0 

2 

"W 

(2, 0, 0, 0) 

2 

2 

0 

4 

C 2 U 2 

(1, 1,0, 0) 

2 

2 

1 

5 

CD 2 U 

(1,0, 0, 0) 

3 

1 

0 

4 

CDU 2 

(1, 1,0, 0) 

3 

2 

1 

6 


4.3. Service Outage 

As stated in Section 2.2.1, a service outage can be modeled as an error burst. We use k to denote the 
number of service items in an outage. Note that k, in effect, measures of the persistence (i.e., duration) of 
the outage. k good and k bad denote the number of correct and incorrect items in the error burst, respectively, 
such that: 

k k good + k bad (33) 

The error in a service item can be measured with respect to correctness, detectability or symmetry, or 
combinations of these. The choice of error metric depends on the purpose and specifics of the analysis 
being performed. Here we use e ; to denote the error in the i-th item of the error burst, with 1 < i < k. The 
severity of corruption in an outage, denoted s, is the amount of error in the outage, which we define as the 
sum of the error in the service items. 

s = L e, (34) 

Let 8 be an index for the error levels (i.e., magnitude of the error) in the chosen item error scale, such 
that 0 < 8 < S max , where S max denotes the largest value on the item error scale, kg denotes the number of 
items in the error burst that have an error of 8. Then, the corruption severity can be expressed as follows. 

s = Zg 8k s (35) 


The number of service items in the outage can be expressed as: 
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k — k 0 + . . . + kg max (36) 

Also: 

kgood = ko (37) 

kbad — k^ + ... + kg max (38) 

Alternatively, the corruption severity can be expressed in terms of the occurrence rates of the error 
levels. To do this, let <j>s denote the rate of item error level 8 , such that: 

4>s - kg/k (39) 

Then: 

s - k-(I 8 6 -<|> 5 ) (40) 

The summation in (40) is the mean value of service item error in the outage, denoted e mean : 

^mean — ^5 ^*^5 (41) 

So: 


s — k-e mean (42) 

The operation of the system may be periodic (or cyclic) such that the pattern of service repeats every 
certain number of items. In that case, we use m to denote the number of expected service items per cycle. 
The duration of an error burst can be expressed in terms of the number of cycles, denoted A, such that: 

k = m-A (43) 


By normalizing the corruption severity with respect to the number of items per cycle, we can directly 
compare outage severity for periodic systems performing the same function, with the same user interface 
and with the same real-time cycle duration, but operating at different data rates such that they deliver 
different number of items per cycle. Let s denote the data-rate-normalized corruption severity. 

s* = s/m = A-(I S 8 -(|)g) = A-e mcan (44) 

Finally, to compute the normalized corruptibility and resilience, we need to define a maximum value 
for the corruption severity, denoted s max . Let k max denote the maximum possible (or expected) number of 
items in an outage. Then: 

Smax k max *e max (45) 

For a cyclic system, the maximum data-rate-normalized corruption severity s* max is given by: 


* 

S max 


= A 


max ^max 


( 46 ) 


where A max denotes the maximum number of cycles in an outage. 
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5. System Perturbation Metrics 


A perturbation is an outage of the services provided by one or more internal system components due to 
(transient or permanent) defects in them. These defects mean that, in effect, the defective components are 
not performing their specified functions and thus act as sources of errors in the system. During a 
perturbation, it is possible for other components to fail to deliver proper services because of effects 
propagated out from the defective components. The severity of a perturbation is determined by the 
service corruption only at the defective components, and non-defective components are treated as if they 
were performing correctly for the purpose of computing the corruption in a perturbation. Note that it is 
perfectly legitimate for all system components to become defective during a perturbation. However, 
depending on the purpose of the analysis being performed, there may be an upper bound constraint on the 
number of simultaneously defective components. 

All the concepts developed to measure disruption corruption directly apply to an outage at an internal 
component. Let Sj denote the severity of corruption at the j-th component, with 1 < j < y, where y denotes 
the number of components in the system. Sj is determined as described previously for a disruption. Note 
that Sj = 0 for a unperturbed component. Then, the total perturbation severity s is given by: 

s = Ij Sj (47) 

To determine the maximum perturbation severity, we simply need to maximize the values of the 
component corruption severities. 

Smax — Sj Sj.max (48) 


6. Final Remarks 

In this report, we have proposed a new approach for the assessment of system upset resilience. This 
approach is based on an analysis of desirable, top-level system attributes interpreted in terms of a generic 
system service model. Special emphasis is given to the attributes for safety-critical, real-time systems, 
including distributed systems. Combining the service model with a structural system description model, 
we develop a stimulus-response concept linking the events internal to the system to the behavior 
observable at the external system interface. We propose the use of fault injection experiments to 
stimulate the system and generate a set of corresponding responses. We have proposed a quantitative 
definition of resilience based on the statistical characterization of this response space. To enable this, we 
defined a set of service error metrics derived from insight into the classification criteria implicit in the 
OTH fault model. Throughout, we have tried to develop a general approach that leverages existing 
system concepts and applies novel and mathematically sound error metrics that are also meaningful in the 
analysis and design of systems. 

This approach will be applied in the analysis of observed HIRF effects in the HEC experiment. To 
characterize the relation between field disturbances and internal perturbations, we will use the relation 
described in [87] between the strength of the radiated field and the severity of error bursts in a system, as 
well as analyses of state data collected during the experiment to identify low-level physical components 
directly affected by the disturbances. Quantitative analyses of the response space using the metrics 


24 



described in this report will enable us to gain insight into the effectiveness of the fault handling 
mechanisms of the system. In particular, we intend to identify error containment and recovery 
weaknesses in the network and to propose ways to strengthen the design. 

We will also apply the resilience assessment approach in a simulated fault-injection experiment using 
error-propagation system models. The stimulus will be perturbations with a severity distribution based on 
the disturbance space characterized in [87] . The models will be validated through review and comparison 
with the results from the HEC experiment. The purpose of this experiment will be to achieve a thorough 
understanding of the relation between internal upsets and the propagated effects observable at the external 
system interface. 

We expect that successful application of the proposed upset resilience assessment approach to the 
analysis of the HEC physical-fault injection experiment and to the simulated-fault injection experiment 
will affirm the practicality of the approach. 

With the insight gained from the analysis of the fault injection experiments, we will then tackle the 
problem of analysis of self-stabilization-based resilience in systems with required safety and real-time 
attributes. The first goal in this direction is the design of an advanced version of the ROBUS-2 system 
[93] with provable self-stabilization properties while retaining its foundation on the unified fault- 
tolerance theory described Miner et al. [61], including the dynamic fault assumptions. We expect that 
attainment of this goal, especially proving self-stabilization properties, will demand a fundamental 
breakthrough in the analysis of distributed clock synchronization and membership protocols in the 
presence of U-type faults. 


25 



Appendix A. Proofs for Metrics of Service Item Error 


In this appendix it is shown that the service-item error metrics defined in Section 4 satisfy the required 
mathematical properties for a metric. The properties, introduced in Section 2.5, are restated here for 
convenience. 










Non-negativity: 

Symmetry: 

Identity of Indiscernibles: 
Triangle Inequality: 


d(x, y) > 0 

d(x, y) = d(y, x) 

d(x, y) = 0 if and only if x = y 

d(x, z) < d(x, y) + d(y, z) 


A.l. Error metrics for a single-user service item 

We consider separately the distance metrics for the dimensions of correctness and detectability, 
followed by the case of correctness-and-detectability distance. 

A.1.1. Correctness 

Recall from Section 4.1 that with respect to correctness, a simplex item x can be either correct or 
incorrect, and it can be represented by the Boolean variable x c , which is TRUE for a correct item. Table 
A.l shows that the following relation holds. 

d c (x,y) = le c (x) - e c (y)l (A.l) 


Table A.l: Correctness error and distance for a simplex item 


Xc 

yc 

e c (x) 

ec(y) 

le c (x) - e c (y)l 

dc(x, y) 

T 

T 

0 

0 

0 

0 

T 

F 

0 

1 

1 

1 

F 

T 

1 

0 

1 

1 

F 

F 

1 

1 

0 

0 


We use equation (A.l) and Table A.l to show that d c (x,y) satisfies the properties of a metric. 

• Non-negativity: This is a property of the absolute -value function and applies to d c (x,y) from equality 
(A.l). This is also shown by inspection of the last column of Table A.l. 

• Symmetry: This is shown by inspection of the first, second and last columns in Table A.l. Also, 
using (A.l): 

d c (x,y) = le c (x) - e c (y)l = le c (y) - e c (x)l = d c (y,x) 

• Identify of Indiscernibles: This is shown by comparing the last column in Table A.l with the first and 
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second columns combined. 


• Triangle Inequality: It is well known that the Triangle Inequality applies to the absolute-value 

function, such that la + bl < lal + Ibl for arbitrary real-valued a and b. If we substitute ec(x) - ec(y) for 
a and ec(y) - ec(z) for b, we get: 

I [e c (x) - e c (y)] + [e c (y) - e c (z)]l < le c (x) - e c (y)l + le c (x) - e c (z)l 
Using equation (A.l): 

d c (x,z) < d c (x,y) + d c (y,z) 

A.1.2. Detectability 

The proofs that equation (A.2) is valid and that d D (x,y) satisfies the properties of a metric are identical 
to the proofs for correctness. 

d D (x,y) = le D (x) - e D (y)l (A.2) 

A.1.3. Correctness and Detectability 

In Section 4.1, d C o(x,y) was defined as: 

<1cd(x, y) = d c (x, y) + d D (x, y) 

Using (A.l) and (A.2), we get: 

d C D(x,y) = le c (x) - e c (y)l + le D (x) - e D (y)l (A.3) 

Table A.2 is based on the content of Tables 2 and 3 in Section 4.1. Table A.2 shows that the following 
relation is valid. 


d C D(x,y) = le CD (x) - e CD (y)l (A.4) 

Table A.2: Correctness-and-detectability error and distance for a simplex item 


X 

y 

Ccd( x ) 

e C D(y) 

le CD (x) - e CD (y)l 

<1cd(x, y) 

c 

c 

0 

0 

0 

0 

c 

D 

0 

1 

1 

1 

c 

U 

0 

2 

2 

2 

D 

D 

1 

1 

0 

0 

D 

U 

1 

2 

1 

1 

u 

U 

2 

2 

0 

0 


From basic algebra, we know that for natural numbers a and b, the distance between them is given by 
d(a, b) = la - bl. From Table 3 in Section 4.1, note that the CD error scale is a simple linear scale over the 
natural numbers from 0 to 2 where the values correspond to the CD errors of C, D, and U. (A.4) is valid 
simply by definition of the distance between points on that scale. 
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Equation (A.4) and Table A.2 can be used to show that dc D (x,y) satisfies the properties of a metric. 
The reasoning is similar to the one used in Section A.l for the correctness distance of a simplex item. 

Note that, by equations (A. 3) and (A.4), the following relation is valid. 

Ie c (x) - e c (y)l + le D (x) - e D (y)l = l[e c (x) + e D (x)] - [e c (y) + e D (y)]l (A.5) 


A.2. Error metrics for a multiple-user service item 

We now show that the metrics for multi-user service items satisfy the properties of a mathematical 
metric. Recall that a multi-user service item X is a vector of simplex service items denoted by x ; for 1 < i 
< n. 

X = (xi, x 2 , ..., x n ) 

A.2.1. Correctness 

From Section 4.2.1: 


d c (X, Y) = dc(x l5 yO + ... + d c (x n , y n ) (A.6) 

Using equation (A.l), equation (A.6) can be expressed as: 


d c (X, Y) = le c (xi) - e c (yi)l + . . . + le c (x n ) - e c (y n )l (A.7) 

We leverage equations (A.6) and (A.7) to show that d c (X, Y) satisfies the properties of a metric. 

• Non-negativity: This is a property of the absolute-value function. It applies to all d c (x„ yd and it is 
preserved by addition in (A.6). 

• Symmetry: Using (A.7): 

d c (X, Y) = le c (xi) - e c (yi)l + . . . + le c (x n ) - e c (y n )l = le c (yi) - edxdl + . . . + le c (y n ) - e c (x n )l = d c (Y, X) 


• Identify of Indiscernibles: From Section A. 1.1, we know that Identify of Indiscernibles applies to 
each of the summands in equation (A.6). Because d c (x ; , y ,) > 0, d c (X,Y) = 0 requires d c (x ; , y,) = 0 for 
all i. Therefore, with respect to correctness, x ; = y„ which means that X = Y with respect to 
correctness. Reasoning in the opposite direction completes the proof. 

• Triangle Inequality: From (A.6) : 


d c (X, Z) = d c (xi, zj) + . . . + d c (x n , z n ) 


Using the proof in Section A. 1.1: 

d c (x;, z ; ) < d c (Xi, yj + d c (yi, zd 
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Thus: 


d c (X, Z) < [d c (x!, yO + ... + d c (x n , y n )] + [d c (yi, z { )+ ... + d c (y n , z n )] 
d c (X, Z) < d c (X, Y) + d c (Y, Z) 

A.2.2. Detectability 

The proof that d D (X, Y) is a metric is essentially the same as in Section A.2.1 with d c (X, Y) replaced 
by d D (X, Y). 

A.2.3. Correctness and Detectability 

From equation (15) in Section 4.2.1: 


dcD(X, Y) - d CD (xi, yi) + ... + dcD(x n , y n ) (A. 8) 

Using equation (5) from Section 4.1: 

d CD (X, Y) = [d c (x b yO + d D (x b yi)] + . . . + [d c (x n , y n ) + d D (x n , y n )] 
d CD (X, Y) = [dc(x!, yO + ... + d c (x n , y n )] + [d D (xi, yO + . . . + d D (x n , y n )] 

Using equations (13) and (14) from Section 4.2.1: 

d CD (X, Y) = d c (X, Y) + d D (X, Y) (A. 9) 

Equation (A.9) can be leveraged to show that d C o(X,Y) satisfies the properties of a metric. 

• Non-negativity: From Sections A.2.1 and A.2.2, d c (X, Y) and d D (X, Y) individually satisfy this 
property. Their sum in (A.9) preserves this property. 

• Symmetry: From Sections A.2.1 and A.2.2, we know that d c (X, Y) and d D (X, Y) individually satisfy 
this property. Then: 

d CD (X, Y) = d c (X, Y) + d D (X, Y) = d c (Y, X) + d D (Y, X) - d CD (Y, X) 

• Identify of Indiscernibles: From Sections A.2.1 and A.2.2, we know that d c (X, Y) and d D (X, Y) 
individually satisfy this property. d C o(X, Y) = 0 if d c (X, Y) = 0 and d D (X, Y) = 0, which require X = 
Y with respect to correctness and detectability. Reasoning in the opposite direction, it can be shown 
that X = Y with respect to correctness and detectability implies d C D(X, Y) = 0. 

• Triangle Inequality: From Sections A.2.1 and A.2.2, we know that d c (X, Y) and d D (X, Y) 
individually satisfy this property. Thus: 

d CD (X, Z) = d c (X, Z) + d D (X, Z) < [dc(X, Y) + d c (Y, Z)] + [d D (X, Y) + d D (Y, Z)] 


d CD (X, Z) < d CD (X, Y) + d CD (Y, Z) 
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A.2.4. Symmetry 


Recall from Section 4.2.2 that for cluster-based symmetry, a multi-user service item X can be 
represented by a vector X s of n cluster sizes: 

Xs = (a l5 ... ,a n ), 

where a, denotes the size of the i-th cluster and the a, elements are sorted by decreasing value such that 
0Ci > 0 C 2 > ... > oc n - From Tables 4 and 5 in Section 4.2.2, note that the cluster-based symmetry scale is a 
simple linear scale over the natural numbers from 0 up to some maximum value, where the values 
correspond to the symmetry errors of the cluster patterns. (A. 10) is valid simply by definition of the 
distance between points on that scale. 

d s (X, Y) = le s (X) - e s (Y)l (A. 10) 

Equation (A. 10) can be used to show that d s (X,Y) satisfies the properties of a metric. 

• Non-negativity: This is a property of the absolute-value function and applies to d s (X,Y) from equality 
(A. 10). 

• Symmetry: Using (A. 10): 

d s (X,Y) = le s (X) - e s (Y)l = le s (Y) - e s (X)l = d c (Y,X) 

• Identify of Indiscernibles: From (A. 10), d s (X, Y) = 0 implies that e s (X) = e s (Y). On our simple linear 
symmetry error scale, this means that X s = Y s . 

• Triangle Inequality: It is well known that the Triangle Inequality applies to the absolute-value 

function, such that la + bl < lal + Ibl for arbitrary real-valued a and b. Thus, if we substitute es(X) - 
e s (Y) for a and e s (Y) - e s (Z) for b, we get: 

I [e s (X) - e s (Y)] + [e s (Y) - e s (Z)]l < le s (X) - e s (Y)l + le s (X) - e s (Z)l 
Using equation (A. 10): 
d s (X,Z) < d s (X,Y) + d s (Y,Z) 

A.2.5. Correctness, Detectability and Symmetry 

The total CDS distance between multi-user service items X and Y is given by: 

dcDs(X, Y) - d c (X, Y) + d D (X, Y) + d s (X, Y) (A.ll) 

with: 

d s (X, Y) - ds(X A , s , Y a , s ) - le s (X A , s ) - e s (Y A , s )l (A.12) 


for approximate agreement, or for exact agreement: 
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ds(X, Y) - d s (X UiS , Yu,s) - les(Xu.s) - es(Y UjS )l 


(A.13) 


where es(X A ,s), es(Y A ,s), es(X UjS ) and c s ( Y us ) are computed on their respective symmetry error scales. 
Using equation (A.9), equation (A. 11) can be expressed as: 


d CDS (X, Y) = d CD (X, Y) + d s (X, Y) (A. 14) 

We know that d C o(X, Y) and d s (X, Y) satisfy the mathematical properties of a metric. We use this fact 

to show that d C Ds(X, Y) also satisfies the properties of a metric. 

• Non-negativity: d C D(X, Y) > 0 and d s (X, Y) > 0 imply that d C Ds(X, Y) > 0. 

• Symmetry: d CDS (X, Y) = d CD (X, Y) + d s (X, Y) - d CD (Y, X) + d s (Y, X) = d CDS (Y, X) 

• Identify of Indiscernibles: d C Ds(X, Y) = 0 implies that d C o(X, Y) = 0 and d s (X, Y) = 0. d C D(X, Y) = 0 
implies that X = Y with respect to correctness and detectability. Using the notation in Section 4.2.1, 
X C d = Y C d- For approximate agreement, this implies that X A and Y A have the same number of 
elements: b + w (see equation (25) in Section 4.2.3). For exact agreement, Xu and Y w have the same 
number of elements: w (see equation (31) in Section 4.2.3). This, combined with d s (X, Y) = 0, 
implies that X AS = Y A S for approximate agreement and X u s = Y us for exact agreement. Thus, X = Y 
with respect to correctness, detectability and symmetry. In the other direction, X = Y with respect to 
correctness, detectability and symmetry implies d C o(X, Y) = 0 and d s (X, Y) = 0, which implies 
d CDS (X, Y) = 0. 

• Triangle Inequality: We know that: 


d CD (X, Z) < d CD (X, Y) + d CD (Y, Z) 


and 

d s (X, Z) < d s (X, Y) + d s (Y, Z) 

Adding these two inequalities: 

d CD (X, Z) + d s (X, Z) < [d CD (X, Y) + d s (X, Y)] + [d CD (Y, Z)+ d s (Y, Z)\ 
Therefore: 

dcDs(X, Z) < dcDs(X, Y) + dcDs(Y, Z) 
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SDOA 

SHM 

SIM 
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SPIDER 

SUT 

TA 

TCS 


Alternating Current 

American Standard Code for Information Interchange 

Application Specific Integrated Circuit 

Bus Interface Unit 

Controller Coordination Protocol 

Commercial Off-The-Shelf 

Central Processing Unit 

Correct Symmetric 

Continuous Wave 

Direct Current 

Derivation Systems, Inc. 

Error Containment Region 
Electromagnetic 
Electromagnetic Interference 
Fault Containment Region 

Fine HIRF Susceptibility Threshold Characterization 

Field Programmable Gate Array 

Finite State Machine 

Hardware Configuration 

HIRF Effects Characterization 

Hub Fault Analyzer 

High Intensity Radiated Field 

HIRF Susceptibility Threshold Characterization 

International Electrotechnical Commission 

Integrated Modular Architecture 

Integrated Vehicle Health Management 

Lowest Usable Frequency 

Mega-bits per second 

Node Fault Analyzer 

National Institute of Standards and Technology 
Omissive Symmetric 

Omissive -Transmissive Hybrid (fault model) 

Probability Distribution 

Processing Element 

Reverberation Chamber 

Radio Frequency 

Root Mean Square 

Redundancy Management Unit 

Robust Bus; also Reliable Optical Bus 

ROBUS Protocol Processor 

Reconfigurable SPIDER Prototyping Platform 

Radio Technical Commission for Aeronautics 

Small Business Innovation Research 

Standard Deviation 

Single Data Omissive Asymmetric 

System Health Monitor 

Stirrer Induced Modulation 

Strictly Omissive Asymmetric 

Scalable Processor-Independent Design for Extended Reliability 
System Under Test 
Transmissive Asymmetric 
Test Control System 
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VHDL 

VHSIC 


Time Division Multiple Access 
Transmissive Symmetric 
Validation and Verification 
VHSIC Hardware Description Language 
Very High Speed Integrated Circuit 
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